To-do

Rather than focus on p-values, do something like a confusion matrix. 1) set up a cutoff of loadings and/or VIP scores that make a variable a significant contributor. 2) Quantify how often each method gets this right (was it a discriminating variable vs. was it detected as a discriminating variable) 3) summarize this as proportion of false positives or some other index (Ask Matt, or look at notes) # Introduction The purpose of this notebook is to stress-test PCA, RDA, and PLS-DA to better understand their properties. All the datasets in this notebook have 35 variables and 20 observations split into two groups (“a” and “b”). I randomly generate nperm number of data frames using the same parameters, then obtain p-values from each statistical method on all data sets to test if the groups differ. I then visualize the distribution of p-values.

Packages needed

library(chemhelper) # for sim_*() functions
library(tidyverse)
library(ropls) # I'm open to using pls or another package to fit PLS-DA models.
library(iheatmapr) # for interactive correlation heatmaps
# library(broom)
library(vegan) # for rda()

Define Functions

PCA t-test

Does PCA, then does t-test on PC1 scores, returns p-value.

PCA.t.test <- function(df){
  pca.m <- opls(select(df, -group), plotL = FALSE, printL = FALSE)
  pca.d <- get_plotdata(pca.m)$plot_data
  pca.t <- t.test(pca.d$p1 ~ df$group)
  return(pca.t$p.value)
}

RDA ANOVA

Fits RDA model, runs ANOVA for global significance of model, returns p-value.

RDA.anova <- function(df){
  rda.m <- rda(select(df, -group) ~ df$group)
  return(anova(rda.m)$`Pr(>F)`[[1]])
}

PLS-DA

Fits PLS-DA model, calculates p-values with 100 permutations, returns \(Q^2\), \(p_{(R^2_Y)}\), and \(p_{(Q^2)}\)

PLS.test <- function(df){
  pls.m <- opls(select(df, -group), df$group, permI = 100, predI = 2, printL = FALSE, plotL = FALSE)
  return(select(pls.m@summaryDF, `Q2(cum)`, pR2Y, pQ2))
}

1. Many, highly multicollinear distinguishing variables contributing to large effect size.

This code creates a dataset with 30 discriminating variables that co-vary and 5 uncorrelated variables.

base_df <- sim_df(20, 2)
nperm = 50
df1 <- map(1:nperm,
    ~base_df %>% 
      sim_covar(5, var = 1, cov = 0, name = "uncorr") %>% 
      sim_discr(30, var = 1, cov = 0.5, group_means = c(-0.5, 0.5), name = "discr")
) %>% set_names(paste("dataset", 1:nperm))
df1[[1]] %>% 
  select(-group) %>% 
  cor() %>% 
  iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")

PCA 1

pvals.pca1 <- map_dbl(df1, PCA.t.test)

RDA 1

pvals.rda1 <- map_dbl(df1, RDA.anova)

PLS 1

output.pls1 <- map(df1, PLS.test) %>% bind_rows()

Results 1

results1 <- output.pls1 %>% add_column(pvals.rda1, pvals.pca1)
results1 %>% summarize_all(funs(mean = mean))
pvals1 <- results1 %>% 
  select(`PLS pQ2` = pQ2, `RDA ANOVA` = pvals.rda1, `PCA t-test` = pvals.pca1) %>% 
  gather(key = "Method", value = "p value")
ggplot(pvals1, aes(x = `p value`, fill = Method)) +
  # geom_density(alpha = 0.3) +
  geom_histogram(bins = 20, position = "dodge") +
  geom_vline(aes(xintercept = 0.05), color = "red")

In this scenario, PLS gives a low p-value. Since all 30 of the discriminating variables covary strongly with eachother, I interpret this as PLS being highly robust to multicollinearity. That is, there aren’t really 30 discriminating variables, but one latent variable that affects all 30 variables. I imagine that if I were to create 30 discriminating variables in 6 groups of 5 variables that the p-values of all methods would be similar.

2. Needle in a haystack—many, highly collinear variables, few discriminating variables

df2 <- map(1:nperm, ~base_df %>% 
  sim_covar(10, 1, 0.5, "corrA") %>% 
  sim_covar(10, 1, 0.5, "corrB") %>% 
  sim_covar(10, 1, 0.5, "corrC") %>% 
  sim_discr(5, 1, 0, group_means = c(-1, 1), name = "discr")
  )
df2[[2]] %>% 
  select(-group) %>% 
  cor() %>% 
  iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")

PCA 2

pvals.pca2 <- map_dbl(df2, PCA.t.test)
mean(pvals.pca2)
[1] 0.3589526

RDA 2

pvals.rda2 <- map_dbl(df2, RDA.anova)
mean(pvals.rda2)
[1] 0.0019

PLS 2

output.pls2 <- map(df2, PLS.test) %>% bind_rows()

Results 2

results2 <- output.pls2 %>% add_column(pvals.rda2, pvals.pca2)
results2 %>% summarize_all(funs(mean = mean))
pvals2 <- results2 %>% 
  select(`PLS pQ2` = pQ2, `RDA ANOVA` = pvals.rda2, `PCA t-test` = pvals.pca2) %>% 
  gather(key = "Method", value = "p value")
ggplot(pvals2, aes(x = `p value`, fill = Method)) +
  # geom_density(alpha = 0.3) + 
  geom_histogram(position = "dodge", bins = 20) +
  # ylim(0, 20) +
  # xlim(0, 0.5) +
  geom_vline(aes(xintercept = 0.05), color = "red")

Here PCA does less well because the variables that contribute most to variation in the dataset are not those variables that contribute to differences between groups. PLS-DA and RDA both consistently find differences between the groups. However, these significant differences are only due to 5 variables. It’s important to note how much variation is explained by the predictive axes of PLS or the constrained portion of RDA to interpret these results.

3. Red Herring? (AKA Needle in a highly multicollinear haystack.)

Ok, this is ridiculous, but here is a dataset with 30, strongly multicollinear variables that don’t discriminate between groups and only 5 variables that do discriminate between groups.

df3 <- map(1:nperm,
             ~base_df %>% 
               sim_covar(30, 1, 0.7, name = "corr") %>% 
               sim_discr(5, 1, 0, name = "discr", group_means = c(-0.5, 0.5))
)
df3[[1]] %>% 
  select(-group) %>% 
  cor() %>% 
  iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")

PCA 3

pvals.pca3 <- map_dbl(df3, PCA.t.test)
mean(pvals.pca3)
[1] 0.4606713

RDA 2

pvals.rda3 <- map_dbl(df3, RDA.anova)
mean(pvals.rda3)
[1] 0.19278

PLS 2

output.pls3 <- map(df3, PLS.test) %>% bind_rows()

Results 3

results3 <- output.pls3 %>% add_column(pvals.rda3, pvals.pca3)
results3 %>% summarize_all(funs(mean = mean))
meanQ <- mean(results3$`Q2(cum)`)
pvals3 <- results3 %>% 
  select(`PLS pQ2` = pQ2, `RDA ANOVA` = pvals.rda3, `PCA t-test` = pvals.pca3) %>% 
  gather(key = "Method", value = "p value")
ggplot(pvals3, aes(x = `p value`, fill = Method)) +
  # geom_density(alpha = 0.3) + 
  geom_histogram(position = "dodge", bins = 20) +
  geom_vline(aes(xintercept = 0.05), color = "red")

Even in this ridiculous scenario, PLS is able to “find” the 5 discriminating variables most of the time and give a significant p-value. The mean \(Q^2\) value is 0.39, so it’s possible that some of those “significant” p-values were obtained from models with poor predictive power. Because within group co-variation is high and between-group co-variation is low, the F-test used to analyze RDA results gives a high p-value. I think.

---
title: "Stress testing PCA, RDA, and PLS"
author: "Eric R. Scott"
output: 
  html_notebook: 
    code_folding: hide
    toc: yes
---

# To-do
Rather than focus on p-values, do something like a confusion matrix.  1) set up a cutoff of loadings and/or VIP scores that make a variable a significant contributor.  2) Quantify how often each method gets this right (was it a discriminating variable vs. was it detected as a discriminating variable) 3) summarize this as proportion of false positives or some other index (Ask Matt, or look at notes)
# Introduction
The purpose of this notebook is to stress-test PCA, RDA, and PLS-DA to better understand their properties.  All the datasets in this notebook have 35 variables and 20 observations split into two groups ("a" and "b"). I randomly generate `nperm` number of data frames using the same parameters, then obtain p-values from each statistical method on all data sets to test if the groups differ.  I then visualize the distribution of p-values.

# Packages needed

```{r message=FALSE, warning=FALSE}
library(chemhelper) # for sim_*() functions
library(tidyverse)
library(ropls) # I'm open to using pls or another package to fit PLS-DA models.
library(iheatmapr) # for interactive correlation heatmaps
# library(broom)
library(vegan) # for rda()
```

# Define Functions

## PCA t-test

Does PCA, then does t-test on PC1 scores, returns p-value.

```{r}
PCA.t.test <- function(df){
  pca.m <- opls(select(df, -group), plotL = FALSE, printL = FALSE)
  pca.d <- get_plotdata(pca.m)$plot_data
  pca.t <- t.test(pca.d$p1 ~ df$group)
  return(pca.t$p.value)
}
```

## RDA ANOVA

Fits RDA model, runs ANOVA for global significance of model, returns p-value.

```{r}
RDA.anova <- function(df){
  rda.m <- rda(select(df, -group) ~ df$group)
  return(anova(rda.m)$`Pr(>F)`[[1]])
}
```

## PLS-DA

Fits PLS-DA model, calculates p-values with 100 permutations, returns $Q^2$, $p_{(R^2_Y)}$, and $p_{(Q^2)}$

```{r}
PLS.test <- function(df){
  pls.m <- opls(select(df, -group), df$group, permI = 100, predI = 2, printL = FALSE, plotL = FALSE)
  return(select(pls.m@summaryDF, `Q2(cum)`, pR2Y, pQ2))
}
```



# 1. Many, highly multicollinear distinguishing variables contributing to large effect size.

This code creates a dataset with 30 discriminating variables that co-vary and 5 uncorrelated variables.

```{r}
base_df <- sim_df(20, 2)
nperm = 50
```

```{r}
df1 <- map(1:nperm,
    ~base_df %>% 
      sim_covar(5, var = 1, cov = 0, name = "uncorr") %>% 
      sim_discr(30, var = 1, cov = 0.5, group_means = c(-0.5, 0.5), name = "discr")
) %>% set_names(paste("dataset", 1:nperm))

df1[[1]] %>% 
  select(-group) %>% 
  cor() %>% 
  iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
```

**PCA 1**

```{r}
pvals.pca1 <- map_dbl(df1, PCA.t.test)
```

**RDA 1**

```{r}
pvals.rda1 <- map_dbl(df1, RDA.anova)
```

**PLS 1**

```{r}
output.pls1 <- map(df1, PLS.test) %>% bind_rows()
```

## Results 1

```{r}
results1 <- output.pls1 %>% add_column(pvals.rda1, pvals.pca1)
results1 %>% summarize_all(funs(mean = mean))
```

```{r}
pvals1 <- results1 %>% 
  select(`PLS pQ2` = pQ2, `RDA ANOVA` = pvals.rda1, `PCA t-test` = pvals.pca1) %>% 
  gather(key = "Method", value = "p value")

ggplot(pvals1, aes(x = `p value`, fill = Method)) +
  # geom_density(alpha = 0.3) +
  geom_histogram(bins = 20, position = "dodge") +
  geom_vline(aes(xintercept = 0.05), color = "red")

```

In this scenario, PLS gives a low p-value.  Since all 30 of the discriminating variables covary strongly with eachother, I interpret this as PLS being highly robust to multicollinearity.  That is, there aren't *really* 30 discriminating variables, but one latent variable that affects all 30 variables.  I imagine that if I were to create 30 discriminating variables in 6 groups of 5 variables that the p-values of all methods would be similar.

# 2. Needle in a haystack---many, highly collinear variables, few discriminating variables

```{r}
df2 <- map(1:nperm, ~base_df %>% 
  sim_covar(10, 1, 0.5, "corrA") %>% 
  sim_covar(10, 1, 0.5, "corrB") %>% 
  sim_covar(10, 1, 0.5, "corrC") %>% 
  sim_discr(5, 1, 0, group_means = c(-1, 1), name = "discr")
  )

df2[[2]] %>% 
  select(-group) %>% 
  cor() %>% 
  iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
```

**PCA 2**

```{r}
pvals.pca2 <- map_dbl(df2, PCA.t.test)
mean(pvals.pca2)
```

**RDA 2**

```{r}
pvals.rda2 <- map_dbl(df2, RDA.anova)
mean(pvals.rda2)
```

**PLS 2**

```{r}
output.pls2 <- map(df2, PLS.test) %>% bind_rows()
```

## Results 2

```{r}
results2 <- output.pls2 %>% add_column(pvals.rda2, pvals.pca2)
results2 %>% summarize_all(funs(mean = mean))
```

```{r}
pvals2 <- results2 %>% 
  select(`PLS pQ2` = pQ2, `RDA ANOVA` = pvals.rda2, `PCA t-test` = pvals.pca2) %>% 
  gather(key = "Method", value = "p value")

ggplot(pvals2, aes(x = `p value`, fill = Method)) +
  # geom_density(alpha = 0.3) + 
  geom_histogram(position = "dodge", bins = 20) +
  # ylim(0, 20) +
  # xlim(0, 0.5) +
  geom_vline(aes(xintercept = 0.05), color = "red")
```
Here PCA does less well because the variables that contribute most to variation in the dataset are not those variables that contribute to differences between groups. PLS-DA and RDA both consistently find differences between the groups.  However, these significant differences are only due to **5** variables.  It's important to note how much variation is explained by the predictive axes of PLS or the constrained portion of RDA to interpret these results.

# 3. Red Herring? (AKA Needle in a highly multicollinear haystack.)

Ok, this is ridiculous, but here is a dataset with 30, strongly multicollinear variables that don't discriminate between groups and only 5 variables that do discriminate between groups.

```{r}
df3 <- map(1:nperm,
             ~base_df %>% 
               sim_covar(30, 1, 0.7, name = "corr") %>% 
               sim_discr(5, 1, 0, name = "discr", group_means = c(-0.5, 0.5))
)

df3[[1]] %>% 
  select(-group) %>% 
  cor() %>% 
  iheatmap(row_labels = TRUE, col_labels = TRUE, name = "cor")
```

**PCA 3**

```{r}
pvals.pca3 <- map_dbl(df3, PCA.t.test)
mean(pvals.pca3)
```

**RDA 2**

```{r}
pvals.rda3 <- map_dbl(df3, RDA.anova)
mean(pvals.rda3)
```

**PLS 2**

```{r}
output.pls3 <- map(df3, PLS.test) %>% bind_rows()
```

## Results 3

```{r}
results3 <- output.pls3 %>% add_column(pvals.rda3, pvals.pca3)
results3 %>% summarize_all(funs(mean = mean))
meanQ <- mean(results3$`Q2(cum)`)
```

```{r}
pvals3 <- results3 %>% 
  select(`PLS pQ2` = pQ2, `RDA ANOVA` = pvals.rda3, `PCA t-test` = pvals.pca3) %>% 
  gather(key = "Method", value = "p value")

ggplot(pvals3, aes(x = `p value`, fill = Method)) +
  # geom_density(alpha = 0.3) + 
  geom_histogram(position = "dodge", bins = 20) +
  geom_vline(aes(xintercept = 0.05), color = "red")
```

Even in this ridiculous scenario, PLS is able to "find" the 5 discriminating variables most of the time and give a significant p-value.  The mean $Q^2$ value is `r round(meanQ, 2)`, so it's possible that some of those "significant" p-values were obtained from models with poor predictive power.  Because within group co-variation is high and between-group co-variation is low, the F-test used to analyze RDA results gives a high p-value.  I think. 


